Objetivo¶
El objetivo principal de esta práctica es emplear técnicas de Análisis Exploratorio de Datos (EDA) para comprender a fondo un conjunto de datos relacionado con solicitudes de préstamos. Este análisis busca identificar patrones y relaciones entre las variables que permitan predecir el comportamiento de los clientes respecto al cumplimiento o incumplimiento de sus pagos. En este sentido, se evaluarán tanto variables demográficas como características financieras de los solicitantes, para determinar su influencia en la capacidad de devolver un préstamo. Este objetivo incluye el análisis detallado de posibles correlaciones entre las variables, la imputación de valores nulos, la transformación de datos categóricos y numéricos, y el escalado de las características según sea necesario.
En última instancia, el propósito es responder a la pregunta clave de negocio planteada: "¿Hay algún tipo de cliente más propenso a no devolver un préstamo?". Esto servirá como base para proponer estrategias que puedan mejorar la evaluación de riesgo y reducir las tasas de incumplimiento en futuras decisiones de crédito.
Pasos¶
- Exploración general de los datos: se analizan la distribución de los datos y las relaciones entre las distintas variables en relación a la variable objetivo.
- Correlaciones, valores nulos y outliers: se evaluan las correlaciones entre variables, se tratan los valores faltantes y los outliers. Se elaboran estrategias para su eliminación o tratamiento.
- Codificación y escalado: se codifican las variables categóricas para poderlas emplearlas en el modelo. Se implementan las técnicas de OneHotEncoding y TargetEncoding en función de los valores únicos de cada variable.
Caso de negocio¶
El caso de negocio gira en torno a un banco que se especializa en ofrecer distintos tipos de préstamos a sus clientes. Actualmente, la institución enfrenta el desafío de identificar a los solicitantes que representan un mayor riesgo de incumplimiento. Para abordar este problema, se busca analizar el comportamiento histórico de los clientes que no han cumplido con sus pagos y establecer criterios claros para mitigar los riesgos en futuras operaciones de préstamo.
El análisis exploratorio se convierte, en este contexto, en una herramienta clave para segmentar a los clientes según sus características y establecer perfiles de riesgo más precisos. Al responder preguntas como qué tipo de clientes o condiciones financieras están más relacionadas con el incumplimiento, el banco puede mejorar sus procesos de evaluación crediticia.
LIBRERIAS¶
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 5000)
seed=123
Diccionario de datos¶
ruta = '../data_practica_EDA/columns_description.xlsx'
dictionary = pd.read_excel(ruta, index_col=0)
dictionary
| Table | Row | Description | Special | |
|---|---|---|---|---|
| 1 | application_data | SK_ID_CURR | ID of loan in our sample | NaN |
| 2 | application_data | TARGET | Target variable (1 - client with payment diffi... | NaN |
| 5 | application_data | NAME_CONTRACT_TYPE | Identification if loan is cash or revolving | NaN |
| 6 | application_data | CODE_GENDER | Gender of the client | NaN |
| 7 | application_data | FLAG_OWN_CAR | Flag if the client owns a car | NaN |
| 8 | application_data | FLAG_OWN_REALTY | Flag if client owns a house or flat | NaN |
| 9 | application_data | CNT_CHILDREN | Number of children the client has | NaN |
| 10 | application_data | AMT_INCOME_TOTAL | Income of the client | NaN |
| 11 | application_data | AMT_CREDIT | Credit amount of the loan | NaN |
| 12 | application_data | AMT_ANNUITY | Loan annuity | NaN |
| 13 | application_data | AMT_GOODS_PRICE | For consumer loans it is the price of the good... | NaN |
| 14 | application_data | NAME_TYPE_SUITE | Who was accompanying client when he was applyi... | NaN |
| 15 | application_data | NAME_INCOME_TYPE | Clients income type (businessman, working, mat... | NaN |
| 16 | application_data | NAME_EDUCATION_TYPE | Level of highest education the client achieved | NaN |
| 17 | application_data | NAME_FAMILY_STATUS | Family status of the client | NaN |
| 18 | application_data | NAME_HOUSING_TYPE | What is the housing situation of the client (r... | NaN |
| 19 | application_data | REGION_POPULATION_RELATIVE | Normalized population of region where client l... | normalized |
| 20 | application_data | DAYS_BIRTH | Client's age in days at the time of application | time only relative to the application |
| 21 | application_data | DAYS_EMPLOYED | How many days before the application the perso... | time only relative to the application |
| 22 | application_data | DAYS_REGISTRATION | How many days before the application did clien... | time only relative to the application |
| 23 | application_data | DAYS_ID_PUBLISH | How many days before the application did clien... | time only relative to the application |
| 24 | application_data | OWN_CAR_AGE | Age of client's car | NaN |
| 25 | application_data | FLAG_MOBIL | Did client provide mobile phone (1=YES, 0=NO) | NaN |
| 26 | application_data | FLAG_EMP_PHONE | Did client provide work phone (1=YES, 0=NO) | NaN |
| 27 | application_data | FLAG_WORK_PHONE | Did client provide home phone (1=YES, 0=NO) | NaN |
| 28 | application_data | FLAG_CONT_MOBILE | Was mobile phone reachable (1=YES, 0=NO) | NaN |
| 29 | application_data | FLAG_PHONE | Did client provide home phone (1=YES, 0=NO) | NaN |
| 30 | application_data | FLAG_EMAIL | Did client provide email (1=YES, 0=NO) | NaN |
| 31 | application_data | OCCUPATION_TYPE | What kind of occupation does the client have | NaN |
| 32 | application_data | CNT_FAM_MEMBERS | How many family members does client have | NaN |
| 33 | application_data | REGION_RATING_CLIENT | Our rating of the region where client lives (1... | NaN |
| 34 | application_data | REGION_RATING_CLIENT_W_CITY | Our rating of the region where client lives wi... | NaN |
| 35 | application_data | WEEKDAY_APPR_PROCESS_START | On which day of the week did the client apply ... | NaN |
| 36 | application_data | HOUR_APPR_PROCESS_START | Approximately at what hour did the client appl... | rounded |
| 37 | application_data | REG_REGION_NOT_LIVE_REGION | Flag if client's permanent address does not ma... | NaN |
| 38 | application_data | REG_REGION_NOT_WORK_REGION | Flag if client's permanent address does not ma... | NaN |
| 39 | application_data | LIVE_REGION_NOT_WORK_REGION | Flag if client's contact address does not matc... | NaN |
| 40 | application_data | REG_CITY_NOT_LIVE_CITY | Flag if client's permanent address does not ma... | NaN |
| 41 | application_data | REG_CITY_NOT_WORK_CITY | Flag if client's permanent address does not ma... | NaN |
| 42 | application_data | LIVE_CITY_NOT_WORK_CITY | Flag if client's contact address does not matc... | NaN |
| 43 | application_data | ORGANIZATION_TYPE | Type of organization where client works | NaN |
| 44 | application_data | EXT_SOURCE_1 | Normalized score from external data source | normalized |
| 45 | application_data | EXT_SOURCE_2 | Normalized score from external data source | normalized |
| 46 | application_data | EXT_SOURCE_3 | Normalized score from external data source | normalized |
| 47 | application_data | APARTMENTS_AVG | Normalized information about building where th... | normalized |
| 48 | application_data | BASEMENTAREA_AVG | Normalized information about building where th... | normalized |
| 49 | application_data | YEARS_BEGINEXPLUATATION_AVG | Normalized information about building where th... | normalized |
| 50 | application_data | YEARS_BUILD_AVG | Normalized information about building where th... | normalized |
| 51 | application_data | COMMONAREA_AVG | Normalized information about building where th... | normalized |
| 52 | application_data | ELEVATORS_AVG | Normalized information about building where th... | normalized |
| 53 | application_data | ENTRANCES_AVG | Normalized information about building where th... | normalized |
| 54 | application_data | FLOORSMAX_AVG | Normalized information about building where th... | normalized |
| 55 | application_data | FLOORSMIN_AVG | Normalized information about building where th... | normalized |
| 56 | application_data | LANDAREA_AVG | Normalized information about building where th... | normalized |
| 57 | application_data | LIVINGAPARTMENTS_AVG | Normalized information about building where th... | normalized |
| 58 | application_data | LIVINGAREA_AVG | Normalized information about building where th... | normalized |
| 59 | application_data | NONLIVINGAPARTMENTS_AVG | Normalized information about building where th... | normalized |
| 60 | application_data | NONLIVINGAREA_AVG | Normalized information about building where th... | normalized |
| 61 | application_data | APARTMENTS_MODE | Normalized information about building where th... | normalized |
| 62 | application_data | BASEMENTAREA_MODE | Normalized information about building where th... | normalized |
| 63 | application_data | YEARS_BEGINEXPLUATATION_MODE | Normalized information about building where th... | normalized |
| 64 | application_data | YEARS_BUILD_MODE | Normalized information about building where th... | normalized |
| 65 | application_data | COMMONAREA_MODE | Normalized information about building where th... | normalized |
| 66 | application_data | ELEVATORS_MODE | Normalized information about building where th... | normalized |
| 67 | application_data | ENTRANCES_MODE | Normalized information about building where th... | normalized |
| 68 | application_data | FLOORSMAX_MODE | Normalized information about building where th... | normalized |
| 69 | application_data | FLOORSMIN_MODE | Normalized information about building where th... | normalized |
| 70 | application_data | LANDAREA_MODE | Normalized information about building where th... | normalized |
| 71 | application_data | LIVINGAPARTMENTS_MODE | Normalized information about building where th... | normalized |
| 72 | application_data | LIVINGAREA_MODE | Normalized information about building where th... | normalized |
| 73 | application_data | NONLIVINGAPARTMENTS_MODE | Normalized information about building where th... | normalized |
| 74 | application_data | NONLIVINGAREA_MODE | Normalized information about building where th... | normalized |
| 75 | application_data | APARTMENTS_MEDI | Normalized information about building where th... | normalized |
| 76 | application_data | BASEMENTAREA_MEDI | Normalized information about building where th... | normalized |
| 77 | application_data | YEARS_BEGINEXPLUATATION_MEDI | Normalized information about building where th... | normalized |
| 78 | application_data | YEARS_BUILD_MEDI | Normalized information about building where th... | normalized |
| 79 | application_data | COMMONAREA_MEDI | Normalized information about building where th... | normalized |
| 80 | application_data | ELEVATORS_MEDI | Normalized information about building where th... | normalized |
| 81 | application_data | ENTRANCES_MEDI | Normalized information about building where th... | normalized |
| 82 | application_data | FLOORSMAX_MEDI | Normalized information about building where th... | normalized |
| 83 | application_data | FLOORSMIN_MEDI | Normalized information about building where th... | normalized |
| 84 | application_data | LANDAREA_MEDI | Normalized information about building where th... | normalized |
| 85 | application_data | LIVINGAPARTMENTS_MEDI | Normalized information about building where th... | normalized |
| 86 | application_data | LIVINGAREA_MEDI | Normalized information about building where th... | normalized |
| 87 | application_data | NONLIVINGAPARTMENTS_MEDI | Normalized information about building where th... | normalized |
| 88 | application_data | NONLIVINGAREA_MEDI | Normalized information about building where th... | normalized |
| 89 | application_data | FONDKAPREMONT_MODE | Normalized information about building where th... | normalized |
| 90 | application_data | HOUSETYPE_MODE | Normalized information about building where th... | normalized |
| 91 | application_data | TOTALAREA_MODE | Normalized information about building where th... | normalized |
| 92 | application_data | WALLSMATERIAL_MODE | Normalized information about building where th... | normalized |
| 93 | application_data | EMERGENCYSTATE_MODE | Normalized information about building where th... | normalized |
| 94 | application_data | OBS_30_CNT_SOCIAL_CIRCLE | How many observation of client's social surrou... | NaN |
| 95 | application_data | DEF_30_CNT_SOCIAL_CIRCLE | How many observation of client's social surrou... | NaN |
| 96 | application_data | OBS_60_CNT_SOCIAL_CIRCLE | How many observation of client's social surrou... | NaN |
| 97 | application_data | DEF_60_CNT_SOCIAL_CIRCLE | How many observation of client's social surrou... | NaN |
| 98 | application_data | DAYS_LAST_PHONE_CHANGE | How many days before application did client ch... | NaN |
| 99 | application_data | FLAG_DOCUMENT_2 | Did client provide document 2 | NaN |
| 100 | application_data | FLAG_DOCUMENT_3 | Did client provide document 3 | NaN |
| 101 | application_data | FLAG_DOCUMENT_4 | Did client provide document 4 | NaN |
| 102 | application_data | FLAG_DOCUMENT_5 | Did client provide document 5 | NaN |
| 103 | application_data | FLAG_DOCUMENT_6 | Did client provide document 6 | NaN |
| 104 | application_data | FLAG_DOCUMENT_7 | Did client provide document 7 | NaN |
| 105 | application_data | FLAG_DOCUMENT_8 | Did client provide document 8 | NaN |
| 106 | application_data | FLAG_DOCUMENT_9 | Did client provide document 9 | NaN |
| 107 | application_data | FLAG_DOCUMENT_10 | Did client provide document 10 | NaN |
| 108 | application_data | FLAG_DOCUMENT_11 | Did client provide document 11 | NaN |
| 109 | application_data | FLAG_DOCUMENT_12 | Did client provide document 12 | NaN |
| 110 | application_data | FLAG_DOCUMENT_13 | Did client provide document 13 | NaN |
| 111 | application_data | FLAG_DOCUMENT_14 | Did client provide document 14 | NaN |
| 112 | application_data | FLAG_DOCUMENT_15 | Did client provide document 15 | NaN |
| 113 | application_data | FLAG_DOCUMENT_16 | Did client provide document 16 | NaN |
| 114 | application_data | FLAG_DOCUMENT_17 | Did client provide document 17 | NaN |
| 115 | application_data | FLAG_DOCUMENT_18 | Did client provide document 18 | NaN |
| 116 | application_data | FLAG_DOCUMENT_19 | Did client provide document 19 | NaN |
| 117 | application_data | FLAG_DOCUMENT_20 | Did client provide document 20 | NaN |
| 118 | application_data | FLAG_DOCUMENT_21 | Did client provide document 21 | NaN |
| 119 | application_data | AMT_REQ_CREDIT_BUREAU_HOUR | Number of enquiries to Credit Bureau about the... | NaN |
| 120 | application_data | AMT_REQ_CREDIT_BUREAU_DAY | Number of enquiries to Credit Bureau about the... | NaN |
| 121 | application_data | AMT_REQ_CREDIT_BUREAU_WEEK | Number of enquiries to Credit Bureau about the... | NaN |
| 122 | application_data | AMT_REQ_CREDIT_BUREAU_MON | Number of enquiries to Credit Bureau about the... | NaN |
| 123 | application_data | AMT_REQ_CREDIT_BUREAU_QRT | Number of enquiries to Credit Bureau about the... | NaN |
| 124 | application_data | AMT_REQ_CREDIT_BUREAU_YEAR | Number of enquiries to Credit Bureau about the... | NaN |
| 176 | previous_application.csv | SK_ID_PREV | ID of previous credit in Home credit related t... | hashed |
| 177 | previous_application.csv | SK_ID_CURR | ID of loan in our sample | hashed |
| 178 | previous_application.csv | NAME_CONTRACT_TYPE | Contract product type (Cash loan, consumer loa... | NaN |
| 179 | previous_application.csv | AMT_ANNUITY | Annuity of previous application | NaN |
| 180 | previous_application.csv | AMT_APPLICATION | For how much credit did client ask on the prev... | NaN |
| 181 | previous_application.csv | AMT_CREDIT | Final credit amount on the previous applicatio... | NaN |
| 182 | previous_application.csv | AMT_DOWN_PAYMENT | Down payment on the previous application | NaN |
| 183 | previous_application.csv | AMT_GOODS_PRICE | Goods price of good that client asked for (if ... | NaN |
| 184 | previous_application.csv | WEEKDAY_APPR_PROCESS_START | On which day of the week did the client apply ... | NaN |
| 185 | previous_application.csv | HOUR_APPR_PROCESS_START | Approximately at what day hour did the client ... | rounded |
| 186 | previous_application.csv | FLAG_LAST_APPL_PER_CONTRACT | Flag if it was last application for the previo... | NaN |
| 187 | previous_application.csv | NFLAG_LAST_APPL_IN_DAY | Flag if the application was the last applicati... | NaN |
| 188 | previous_application.csv | NFLAG_MICRO_CASH | Flag Micro finance loan | NaN |
| 189 | previous_application.csv | RATE_DOWN_PAYMENT | Down payment rate normalized on previous credit | normalized |
| 190 | previous_application.csv | RATE_INTEREST_PRIMARY | Interest rate normalized on previous credit | normalized |
| 191 | previous_application.csv | RATE_INTEREST_PRIVILEGED | Interest rate normalized on previous credit | normalized |
| 192 | previous_application.csv | NAME_CASH_LOAN_PURPOSE | Purpose of the cash loan | NaN |
| 193 | previous_application.csv | NAME_CONTRACT_STATUS | Contract status (approved, cancelled, ...) of ... | NaN |
| 194 | previous_application.csv | DAYS_DECISION | Relative to current application when was the d... | time only relative to the application |
| 195 | previous_application.csv | NAME_PAYMENT_TYPE | Payment method that client chose to pay for th... | NaN |
| 196 | previous_application.csv | CODE_REJECT_REASON | Why was the previous application rejected | NaN |
| 197 | previous_application.csv | NAME_TYPE_SUITE | Who accompanied client when applying for the p... | NaN |
| 198 | previous_application.csv | NAME_CLIENT_TYPE | Was the client old or new client when applying... | NaN |
| 199 | previous_application.csv | NAME_GOODS_CATEGORY | What kind of goods did the client apply for in... | NaN |
| 200 | previous_application.csv | NAME_PORTFOLIO | Was the previous application for CASH, POS, CA... | NaN |
| 201 | previous_application.csv | NAME_PRODUCT_TYPE | Was the previous application x-sell o walk-in | NaN |
| 202 | previous_application.csv | CHANNEL_TYPE | Through which channel we acquired the client o... | NaN |
| 203 | previous_application.csv | SELLERPLACE_AREA | Selling area of seller place of the previous a... | NaN |
| 204 | previous_application.csv | NAME_SELLER_INDUSTRY | The industry of the seller | NaN |
| 205 | previous_application.csv | CNT_PAYMENT | Term of previous credit at application of the ... | NaN |
| 206 | previous_application.csv | NAME_YIELD_GROUP | Grouped interest rate into small medium and hi... | grouped |
| 207 | previous_application.csv | PRODUCT_COMBINATION | Detailed product combination of the previous a... | NaN |
| 208 | previous_application.csv | DAYS_FIRST_DRAWING | Relative to application date of current applic... | time only relative to the application |
| 209 | previous_application.csv | DAYS_FIRST_DUE | Relative to application date of current applic... | time only relative to the application |
| 210 | previous_application.csv | DAYS_LAST_DUE_1ST_VERSION | Relative to application date of current applic... | time only relative to the application |
| 211 | previous_application.csv | DAYS_LAST_DUE | Relative to application date of current applic... | time only relative to the application |
| 212 | previous_application.csv | DAYS_TERMINATION | Relative to application date of current applic... | time only relative to the application |
| 213 | previous_application.csv | NFLAG_INSURED_ON_APPROVAL | Did the client requested insurance during the ... | NaN |
FUNCIONES¶
A partir del segundo notebook, todas las funciones que se emplearán estarán importadas desde el archivo "funciones_auxiliares".
def dame_variables_categoricas(dataset=None, max_unicos=100):
""" Se indica el valor de valores únicos máximo(max_unicos) como un parametro de manera que si lo escribimos a lo largo de la función
más de una vez solo se tenga que cambiar en una línea.
Inputs:
-- dataset: pandas dataframe que contiene los datos
-- max_unicos: máximo de unicos que se desean por columna(variable)
"""
if dataset is None:
print(u'\nFaltan argumentos por pasar a la función')
return 1
lista_variables_categoricas = []
other = []
for i in dataset.columns:
unicos = int(len(np.unique(dataset[i].dropna(axis=0, how='all'))))
if dataset[i].dtype in ['object', 'category'] or unicos < max_unicos: # A traves de la lista simplificamos el código.
lista_variables_categoricas.append(i)
else:
other.append(i)
return lista_variables_categoricas, other
pd_loan = pd.read_csv('../data_practica_EDA/application_data.csv').set_index('SK_ID_CURR')
pd_loan
| TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | 0.0252 | 0.0383 | 0.9722 | 0.6341 | 0.0144 | 0.0000 | 0.0690 | 0.0833 | 0.1250 | 0.0377 | 0.0220 | 0.0198 | 0.0 | 0.0000 | 0.0250 | 0.0369 | 0.9722 | 0.6243 | 0.0144 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0375 | 0.0205 | 0.0193 | 0.0000 | 0.0000 | reg oper account | block of flats | 0.0149 | Stone, brick | No | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | 0.0924 | 0.0538 | 0.9851 | 0.8040 | 0.0497 | 0.0806 | 0.0345 | 0.2917 | 0.3333 | 0.0128 | 0.0790 | 0.0554 | 0.0 | 0.0000 | 0.0968 | 0.0529 | 0.9851 | 0.7987 | 0.0608 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0132 | 0.0787 | 0.0558 | 0.0039 | 0.0100 | reg oper account | block of flats | 0.0714 | Block | No | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 26.0 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 456251 | 0 | Cash loans | M | N | N | 0 | 157500.0 | 254700.0 | 27558.0 | 225000.0 | Unaccompanied | Working | Secondary / secondary special | Separated | With parents | 0.032561 | -9327 | -236 | -8456.0 | -1982 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Sales staff | 1.0 | 1 | 1 | THURSDAY | 15 | 0 | 0 | 0 | 0 | 0 | 0 | Services | 0.145570 | 0.681632 | NaN | 0.2021 | 0.0887 | 0.9876 | 0.8300 | 0.0202 | 0.22 | 0.1034 | 0.6042 | 0.2708 | 0.0594 | 0.1484 | 0.1965 | 0.0753 | 0.1095 | 0.1008 | 0.0172 | 0.9782 | 0.7125 | 0.0172 | 0.0806 | 0.0345 | 0.4583 | 0.0417 | 0.0094 | 0.0882 | 0.0853 | 0.0 | 0.0125 | 0.2040 | 0.0887 | 0.9876 | 0.8323 | 0.0203 | 0.22 | 0.1034 | 0.6042 | 0.2708 | 0.0605 | 0.1509 | 0.2001 | 0.0757 | 0.1118 | reg oper account | block of flats | 0.2898 | Stone, brick | No | 0.0 | 0.0 | 0.0 | 0.0 | -273.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 456252 | 0 | Cash loans | F | N | Y | 0 | 72000.0 | 269550.0 | 12001.5 | 225000.0 | Unaccompanied | Pensioner | Secondary / secondary special | Widow | House / apartment | 0.025164 | -20775 | 365243 | -4388.0 | -4090 | NaN | 1 | 0 | 0 | 1 | 1 | 0 | NaN | 1.0 | 2 | 2 | MONDAY | 8 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | NaN | 0.115992 | NaN | 0.0247 | 0.0435 | 0.9727 | 0.6260 | 0.0022 | 0.00 | 0.1034 | 0.0833 | 0.1250 | 0.0579 | 0.0202 | 0.0257 | 0.0000 | 0.0000 | 0.0252 | 0.0451 | 0.9727 | 0.6406 | 0.0022 | 0.0000 | 0.1034 | 0.0833 | 0.1250 | 0.0592 | 0.0220 | 0.0267 | 0.0 | 0.0000 | 0.0250 | 0.0435 | 0.9727 | 0.6310 | 0.0022 | 0.00 | 0.1034 | 0.0833 | 0.1250 | 0.0589 | 0.0205 | 0.0261 | 0.0000 | 0.0000 | reg oper account | block of flats | 0.0214 | Stone, brick | No | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 456253 | 0 | Cash loans | F | N | Y | 0 | 153000.0 | 677664.0 | 29979.0 | 585000.0 | Unaccompanied | Working | Higher education | Separated | House / apartment | 0.005002 | -14966 | -7921 | -6737.0 | -5150 | NaN | 1 | 1 | 0 | 1 | 0 | 1 | Managers | 1.0 | 3 | 3 | THURSDAY | 9 | 0 | 0 | 0 | 0 | 1 | 1 | School | 0.744026 | 0.535722 | 0.218859 | 0.1031 | 0.0862 | 0.9816 | 0.7484 | 0.0123 | 0.00 | 0.2069 | 0.1667 | 0.2083 | NaN | 0.0841 | 0.9279 | 0.0000 | 0.0000 | 0.1050 | 0.0894 | 0.9816 | 0.7583 | 0.0124 | 0.0000 | 0.2069 | 0.1667 | 0.2083 | NaN | 0.0918 | 0.9667 | 0.0 | 0.0000 | 0.1041 | 0.0862 | 0.9816 | 0.7518 | 0.0124 | 0.00 | 0.2069 | 0.1667 | 0.2083 | NaN | 0.0855 | 0.9445 | 0.0000 | 0.0000 | reg oper account | block of flats | 0.7970 | Panel | No | 6.0 | 0.0 | 6.0 | 0.0 | -1909.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
| 456254 | 1 | Cash loans | F | N | Y | 0 | 171000.0 | 370107.0 | 20205.0 | 319500.0 | Unaccompanied | Commercial associate | Secondary / secondary special | Married | House / apartment | 0.005313 | -11961 | -4786 | -2562.0 | -931 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 9 | 0 | 0 | 0 | 1 | 1 | 0 | Business Entity Type 1 | NaN | 0.514163 | 0.661024 | 0.0124 | NaN | 0.9771 | NaN | NaN | NaN | 0.0690 | 0.0417 | NaN | NaN | NaN | 0.0061 | NaN | NaN | 0.0126 | NaN | 0.9772 | NaN | NaN | NaN | 0.0690 | 0.0417 | NaN | NaN | NaN | 0.0063 | NaN | NaN | 0.0125 | NaN | 0.9771 | NaN | NaN | NaN | 0.0690 | 0.0417 | NaN | NaN | NaN | 0.0062 | NaN | NaN | NaN | block of flats | 0.0086 | Stone, brick | No | 0.0 | 0.0 | 0.0 | 0.0 | -322.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | 675000.0 | Unaccompanied | Commercial associate | Higher education | Married | House / apartment | 0.046220 | -16856 | -1262 | -5128.0 | -410 | NaN | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 2.0 | 1 | 1 | THURSDAY | 20 | 0 | 0 | 0 | 0 | 1 | 1 | Business Entity Type 3 | 0.734460 | 0.708569 | 0.113922 | 0.0742 | 0.0526 | 0.9881 | NaN | 0.0176 | 0.08 | 0.0690 | 0.3750 | NaN | NaN | NaN | 0.0791 | NaN | 0.0000 | 0.0756 | 0.0546 | 0.9881 | NaN | 0.0178 | 0.0806 | 0.0690 | 0.3750 | NaN | NaN | NaN | 0.0824 | NaN | 0.0000 | 0.0749 | 0.0526 | 0.9881 | NaN | 0.0177 | 0.08 | 0.0690 | 0.3750 | NaN | NaN | NaN | 0.0805 | NaN | 0.0000 | NaN | block of flats | 0.0718 | Panel | No | 0.0 | 0.0 | 0.0 | 0.0 | -787.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 |
307511 rows × 121 columns
Dimensión del dataset
print(pd_loan.shape, pd_loan.drop_duplicates().shape)
(307511, 121) (307511, 121)
Tras la ejecución de este código en el que imprimimos la dimensión (columnas y filas) del dataset antes y después de la comprobación de duplicados, se observa que no se ha eliminado ningún registro, lo que indica que no hay duplicados. El dataframe está compuesto de 307.511 filas y 121 columnas.
pd_loan.dtypes.to_dict()
{'TARGET': dtype('int64'),
'NAME_CONTRACT_TYPE': dtype('O'),
'CODE_GENDER': dtype('O'),
'FLAG_OWN_CAR': dtype('O'),
'FLAG_OWN_REALTY': dtype('O'),
'CNT_CHILDREN': dtype('int64'),
'AMT_INCOME_TOTAL': dtype('float64'),
'AMT_CREDIT': dtype('float64'),
'AMT_ANNUITY': dtype('float64'),
'AMT_GOODS_PRICE': dtype('float64'),
'NAME_TYPE_SUITE': dtype('O'),
'NAME_INCOME_TYPE': dtype('O'),
'NAME_EDUCATION_TYPE': dtype('O'),
'NAME_FAMILY_STATUS': dtype('O'),
'NAME_HOUSING_TYPE': dtype('O'),
'REGION_POPULATION_RELATIVE': dtype('float64'),
'DAYS_BIRTH': dtype('int64'),
'DAYS_EMPLOYED': dtype('int64'),
'DAYS_REGISTRATION': dtype('float64'),
'DAYS_ID_PUBLISH': dtype('int64'),
'OWN_CAR_AGE': dtype('float64'),
'FLAG_MOBIL': dtype('int64'),
'FLAG_EMP_PHONE': dtype('int64'),
'FLAG_WORK_PHONE': dtype('int64'),
'FLAG_CONT_MOBILE': dtype('int64'),
'FLAG_PHONE': dtype('int64'),
'FLAG_EMAIL': dtype('int64'),
'OCCUPATION_TYPE': dtype('O'),
'CNT_FAM_MEMBERS': dtype('float64'),
'REGION_RATING_CLIENT': dtype('int64'),
'REGION_RATING_CLIENT_W_CITY': dtype('int64'),
'WEEKDAY_APPR_PROCESS_START': dtype('O'),
'HOUR_APPR_PROCESS_START': dtype('int64'),
'REG_REGION_NOT_LIVE_REGION': dtype('int64'),
'REG_REGION_NOT_WORK_REGION': dtype('int64'),
'LIVE_REGION_NOT_WORK_REGION': dtype('int64'),
'REG_CITY_NOT_LIVE_CITY': dtype('int64'),
'REG_CITY_NOT_WORK_CITY': dtype('int64'),
'LIVE_CITY_NOT_WORK_CITY': dtype('int64'),
'ORGANIZATION_TYPE': dtype('O'),
'EXT_SOURCE_1': dtype('float64'),
'EXT_SOURCE_2': dtype('float64'),
'EXT_SOURCE_3': dtype('float64'),
'APARTMENTS_AVG': dtype('float64'),
'BASEMENTAREA_AVG': dtype('float64'),
'YEARS_BEGINEXPLUATATION_AVG': dtype('float64'),
'YEARS_BUILD_AVG': dtype('float64'),
'COMMONAREA_AVG': dtype('float64'),
'ELEVATORS_AVG': dtype('float64'),
'ENTRANCES_AVG': dtype('float64'),
'FLOORSMAX_AVG': dtype('float64'),
'FLOORSMIN_AVG': dtype('float64'),
'LANDAREA_AVG': dtype('float64'),
'LIVINGAPARTMENTS_AVG': dtype('float64'),
'LIVINGAREA_AVG': dtype('float64'),
'NONLIVINGAPARTMENTS_AVG': dtype('float64'),
'NONLIVINGAREA_AVG': dtype('float64'),
'APARTMENTS_MODE': dtype('float64'),
'BASEMENTAREA_MODE': dtype('float64'),
'YEARS_BEGINEXPLUATATION_MODE': dtype('float64'),
'YEARS_BUILD_MODE': dtype('float64'),
'COMMONAREA_MODE': dtype('float64'),
'ELEVATORS_MODE': dtype('float64'),
'ENTRANCES_MODE': dtype('float64'),
'FLOORSMAX_MODE': dtype('float64'),
'FLOORSMIN_MODE': dtype('float64'),
'LANDAREA_MODE': dtype('float64'),
'LIVINGAPARTMENTS_MODE': dtype('float64'),
'LIVINGAREA_MODE': dtype('float64'),
'NONLIVINGAPARTMENTS_MODE': dtype('float64'),
'NONLIVINGAREA_MODE': dtype('float64'),
'APARTMENTS_MEDI': dtype('float64'),
'BASEMENTAREA_MEDI': dtype('float64'),
'YEARS_BEGINEXPLUATATION_MEDI': dtype('float64'),
'YEARS_BUILD_MEDI': dtype('float64'),
'COMMONAREA_MEDI': dtype('float64'),
'ELEVATORS_MEDI': dtype('float64'),
'ENTRANCES_MEDI': dtype('float64'),
'FLOORSMAX_MEDI': dtype('float64'),
'FLOORSMIN_MEDI': dtype('float64'),
'LANDAREA_MEDI': dtype('float64'),
'LIVINGAPARTMENTS_MEDI': dtype('float64'),
'LIVINGAREA_MEDI': dtype('float64'),
'NONLIVINGAPARTMENTS_MEDI': dtype('float64'),
'NONLIVINGAREA_MEDI': dtype('float64'),
'FONDKAPREMONT_MODE': dtype('O'),
'HOUSETYPE_MODE': dtype('O'),
'TOTALAREA_MODE': dtype('float64'),
'WALLSMATERIAL_MODE': dtype('O'),
'EMERGENCYSTATE_MODE': dtype('O'),
'OBS_30_CNT_SOCIAL_CIRCLE': dtype('float64'),
'DEF_30_CNT_SOCIAL_CIRCLE': dtype('float64'),
'OBS_60_CNT_SOCIAL_CIRCLE': dtype('float64'),
'DEF_60_CNT_SOCIAL_CIRCLE': dtype('float64'),
'DAYS_LAST_PHONE_CHANGE': dtype('float64'),
'FLAG_DOCUMENT_2': dtype('int64'),
'FLAG_DOCUMENT_3': dtype('int64'),
'FLAG_DOCUMENT_4': dtype('int64'),
'FLAG_DOCUMENT_5': dtype('int64'),
'FLAG_DOCUMENT_6': dtype('int64'),
'FLAG_DOCUMENT_7': dtype('int64'),
'FLAG_DOCUMENT_8': dtype('int64'),
'FLAG_DOCUMENT_9': dtype('int64'),
'FLAG_DOCUMENT_10': dtype('int64'),
'FLAG_DOCUMENT_11': dtype('int64'),
'FLAG_DOCUMENT_12': dtype('int64'),
'FLAG_DOCUMENT_13': dtype('int64'),
'FLAG_DOCUMENT_14': dtype('int64'),
'FLAG_DOCUMENT_15': dtype('int64'),
'FLAG_DOCUMENT_16': dtype('int64'),
'FLAG_DOCUMENT_17': dtype('int64'),
'FLAG_DOCUMENT_18': dtype('int64'),
'FLAG_DOCUMENT_19': dtype('int64'),
'FLAG_DOCUMENT_20': dtype('int64'),
'FLAG_DOCUMENT_21': dtype('int64'),
'AMT_REQ_CREDIT_BUREAU_HOUR': dtype('float64'),
'AMT_REQ_CREDIT_BUREAU_DAY': dtype('float64'),
'AMT_REQ_CREDIT_BUREAU_WEEK': dtype('float64'),
'AMT_REQ_CREDIT_BUREAU_MON': dtype('float64'),
'AMT_REQ_CREDIT_BUREAU_QRT': dtype('float64'),
'AMT_REQ_CREDIT_BUREAU_YEAR': dtype('float64')}
Se observa cada columna asociada a una variable y el tipo de variable que es, object o numéricas.
Exploración de la variable objetivo¶
pd_plot_target = pd_loan['TARGET']\
.value_counts(normalize=True)\
.mul(100).rename('percent').reset_index()
pd_plot_target_conteo = pd_loan['TARGET'].value_counts().reset_index()
pd_plot_target_pc = pd.merge(pd_plot_target, pd_plot_target_conteo, on=['TARGET'], how='inner')
pd_plot_target_pc
| TARGET | percent | count | |
|---|---|---|---|
| 0 | 0 | 91.927118 | 282686 |
| 1 | 1 | 8.072882 | 24825 |
import plotly.io as pio
pio.renderers.default = 'notebook' # Tengo problemas para visualizar el gráfico en html pero con esta linea se arregla
fig = px.bar(pd_plot_target_pc, x="TARGET", y='percent', text='percent', labels={"TARGET": "Estado de Pago (TARGET)", "percent": "Porcentaje"})
fig.show()
La variable 'TARGET' nos muestra si se ha producido retraso en el pago de las cuotas del préstamo. El 91.93% de los clientes han pagado las cuotas del prestamo sin retraso mientras que el 8.07% han tenido dificultades para pagar. En valores absolutos son 282686 quienes pagaron a tiempo frente a 24825 personas que se retrasaron en el pago. Se observa que la variable está desbalanceada.
Selección de threshold (umbral) por filas y columnas para suprimir valores missing¶
pd_series_null_columns = pd_loan.isnull().sum().sort_values(ascending=False)
pd_series_null_rows = pd_loan.isnull().sum(axis=1).sort_values(ascending=False)
print(pd_series_null_columns.shape, pd_series_null_rows.shape)
pd_null_columnas = pd.DataFrame(pd_series_null_columns, columns=['nulos_columnas'])
pd_null_filas = pd.DataFrame(pd_series_null_rows, columns=['nulos_filas'])
pd_null_filas['target'] = pd_loan['TARGET'].copy()
pd_null_columnas['porcentaje_columnas'] = pd_null_columnas['nulos_columnas']/pd_loan.shape[0]
pd_null_filas['porcentaje_filas']= pd_null_filas['nulos_filas']/pd_loan.shape[1]
(121,) (307511,)
pd_loan.shape
(307511, 121)
pd_null_columnas
| nulos_columnas | porcentaje_columnas | |
|---|---|---|
| COMMONAREA_AVG | 214865 | 0.698723 |
| COMMONAREA_MODE | 214865 | 0.698723 |
| COMMONAREA_MEDI | 214865 | 0.698723 |
| NONLIVINGAPARTMENTS_AVG | 213514 | 0.694330 |
| NONLIVINGAPARTMENTS_MODE | 213514 | 0.694330 |
| NONLIVINGAPARTMENTS_MEDI | 213514 | 0.694330 |
| FONDKAPREMONT_MODE | 210295 | 0.683862 |
| LIVINGAPARTMENTS_MEDI | 210199 | 0.683550 |
| LIVINGAPARTMENTS_AVG | 210199 | 0.683550 |
| LIVINGAPARTMENTS_MODE | 210199 | 0.683550 |
| FLOORSMIN_AVG | 208642 | 0.678486 |
| FLOORSMIN_MODE | 208642 | 0.678486 |
| FLOORSMIN_MEDI | 208642 | 0.678486 |
| YEARS_BUILD_AVG | 204488 | 0.664978 |
| YEARS_BUILD_MEDI | 204488 | 0.664978 |
| YEARS_BUILD_MODE | 204488 | 0.664978 |
| OWN_CAR_AGE | 202929 | 0.659908 |
| LANDAREA_MEDI | 182590 | 0.593767 |
| LANDAREA_AVG | 182590 | 0.593767 |
| LANDAREA_MODE | 182590 | 0.593767 |
| BASEMENTAREA_MEDI | 179943 | 0.585160 |
| BASEMENTAREA_AVG | 179943 | 0.585160 |
| BASEMENTAREA_MODE | 179943 | 0.585160 |
| EXT_SOURCE_1 | 173378 | 0.563811 |
| NONLIVINGAREA_AVG | 169682 | 0.551792 |
| NONLIVINGAREA_MODE | 169682 | 0.551792 |
| NONLIVINGAREA_MEDI | 169682 | 0.551792 |
| ELEVATORS_MODE | 163891 | 0.532960 |
| ELEVATORS_AVG | 163891 | 0.532960 |
| ELEVATORS_MEDI | 163891 | 0.532960 |
| WALLSMATERIAL_MODE | 156341 | 0.508408 |
| APARTMENTS_AVG | 156061 | 0.507497 |
| APARTMENTS_MEDI | 156061 | 0.507497 |
| APARTMENTS_MODE | 156061 | 0.507497 |
| ENTRANCES_AVG | 154828 | 0.503488 |
| ENTRANCES_MEDI | 154828 | 0.503488 |
| ENTRANCES_MODE | 154828 | 0.503488 |
| LIVINGAREA_MEDI | 154350 | 0.501933 |
| LIVINGAREA_MODE | 154350 | 0.501933 |
| LIVINGAREA_AVG | 154350 | 0.501933 |
| HOUSETYPE_MODE | 154297 | 0.501761 |
| FLOORSMAX_MEDI | 153020 | 0.497608 |
| FLOORSMAX_MODE | 153020 | 0.497608 |
| FLOORSMAX_AVG | 153020 | 0.497608 |
| YEARS_BEGINEXPLUATATION_MEDI | 150007 | 0.487810 |
| YEARS_BEGINEXPLUATATION_MODE | 150007 | 0.487810 |
| YEARS_BEGINEXPLUATATION_AVG | 150007 | 0.487810 |
| TOTALAREA_MODE | 148431 | 0.482685 |
| EMERGENCYSTATE_MODE | 145755 | 0.473983 |
| OCCUPATION_TYPE | 96391 | 0.313455 |
| EXT_SOURCE_3 | 60965 | 0.198253 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_HOUR | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_MON | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_QRT | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_DAY | 41519 | 0.135016 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 41519 | 0.135016 |
| NAME_TYPE_SUITE | 1292 | 0.004201 |
| DEF_30_CNT_SOCIAL_CIRCLE | 1021 | 0.003320 |
| OBS_60_CNT_SOCIAL_CIRCLE | 1021 | 0.003320 |
| OBS_30_CNT_SOCIAL_CIRCLE | 1021 | 0.003320 |
| DEF_60_CNT_SOCIAL_CIRCLE | 1021 | 0.003320 |
| EXT_SOURCE_2 | 660 | 0.002146 |
| AMT_GOODS_PRICE | 278 | 0.000904 |
| AMT_ANNUITY | 12 | 0.000039 |
| CNT_FAM_MEMBERS | 2 | 0.000007 |
| DAYS_LAST_PHONE_CHANGE | 1 | 0.000003 |
| AMT_INCOME_TOTAL | 0 | 0.000000 |
| FLAG_DOCUMENT_8 | 0 | 0.000000 |
| CODE_GENDER | 0 | 0.000000 |
| FLAG_OWN_CAR | 0 | 0.000000 |
| FLAG_OWN_REALTY | 0 | 0.000000 |
| FLAG_DOCUMENT_2 | 0 | 0.000000 |
| FLAG_DOCUMENT_3 | 0 | 0.000000 |
| FLAG_DOCUMENT_4 | 0 | 0.000000 |
| FLAG_DOCUMENT_5 | 0 | 0.000000 |
| FLAG_DOCUMENT_6 | 0 | 0.000000 |
| FLAG_DOCUMENT_7 | 0 | 0.000000 |
| FLAG_DOCUMENT_9 | 0 | 0.000000 |
| FLAG_DOCUMENT_21 | 0 | 0.000000 |
| FLAG_DOCUMENT_10 | 0 | 0.000000 |
| FLAG_DOCUMENT_11 | 0 | 0.000000 |
| CNT_CHILDREN | 0 | 0.000000 |
| FLAG_DOCUMENT_13 | 0 | 0.000000 |
| FLAG_DOCUMENT_14 | 0 | 0.000000 |
| FLAG_DOCUMENT_15 | 0 | 0.000000 |
| FLAG_DOCUMENT_16 | 0 | 0.000000 |
| FLAG_DOCUMENT_17 | 0 | 0.000000 |
| FLAG_DOCUMENT_18 | 0 | 0.000000 |
| FLAG_DOCUMENT_19 | 0 | 0.000000 |
| FLAG_DOCUMENT_20 | 0 | 0.000000 |
| FLAG_DOCUMENT_12 | 0 | 0.000000 |
| AMT_CREDIT | 0 | 0.000000 |
| ORGANIZATION_TYPE | 0 | 0.000000 |
| NAME_INCOME_TYPE | 0 | 0.000000 |
| LIVE_CITY_NOT_WORK_CITY | 0 | 0.000000 |
| NAME_CONTRACT_TYPE | 0 | 0.000000 |
| REG_CITY_NOT_WORK_CITY | 0 | 0.000000 |
| REG_CITY_NOT_LIVE_CITY | 0 | 0.000000 |
| LIVE_REGION_NOT_WORK_REGION | 0 | 0.000000 |
| REG_REGION_NOT_WORK_REGION | 0 | 0.000000 |
| REG_REGION_NOT_LIVE_REGION | 0 | 0.000000 |
| HOUR_APPR_PROCESS_START | 0 | 0.000000 |
| WEEKDAY_APPR_PROCESS_START | 0 | 0.000000 |
| REGION_RATING_CLIENT_W_CITY | 0 | 0.000000 |
| REGION_RATING_CLIENT | 0 | 0.000000 |
| FLAG_EMAIL | 0 | 0.000000 |
| FLAG_PHONE | 0 | 0.000000 |
| FLAG_CONT_MOBILE | 0 | 0.000000 |
| FLAG_WORK_PHONE | 0 | 0.000000 |
| FLAG_EMP_PHONE | 0 | 0.000000 |
| FLAG_MOBIL | 0 | 0.000000 |
| DAYS_ID_PUBLISH | 0 | 0.000000 |
| DAYS_REGISTRATION | 0 | 0.000000 |
| DAYS_EMPLOYED | 0 | 0.000000 |
| DAYS_BIRTH | 0 | 0.000000 |
| REGION_POPULATION_RELATIVE | 0 | 0.000000 |
| NAME_HOUSING_TYPE | 0 | 0.000000 |
| NAME_FAMILY_STATUS | 0 | 0.000000 |
| NAME_EDUCATION_TYPE | 0 | 0.000000 |
| TARGET | 0 | 0.000000 |
threshold=0.80
list_vars_not_null = list(pd_null_columnas[pd_null_columnas['porcentaje_columnas']<threshold].index)
pd_loan_filter_null = pd_loan.loc[:, list_vars_not_null]
pd_loan_filter_null.shape
(307511, 121)
Se utiliza un umbral de 0.85 para eliminar columnas con más del 85% de valores nulos almacenando la totalidad de las columnas.
Aunque un umbral menor (como 0.6) podría ser más estricto, se decidió conservar más columnas, ya que las proporciones de nulos no son consideradas excesivamente altas.
pd_null_filas
| nulos_filas | target | porcentaje_filas | |
|---|---|---|---|
| SK_ID_CURR | |||
| 315294 | 61 | 0 | 0.504132 |
| 255145 | 61 | 0 | 0.504132 |
| 329262 | 61 | 0 | 0.504132 |
| 235599 | 61 | 0 | 0.504132 |
| 412312 | 61 | 0 | 0.504132 |
| ... | ... | ... | ... |
| 250717 | 0 | 0 | 0.000000 |
| 250702 | 0 | 0 | 0.000000 |
| 250697 | 0 | 0 | 0.000000 |
| 250680 | 0 | 0 | 0.000000 |
| 278202 | 0 | 0 | 0.000000 |
307511 rows × 3 columns
La fila con mayor porcentaje de valores missing obtiene un 50% compartiendo valores con varias filas más.
list_cat_vars, other = dame_variables_categoricas(dataset=pd_loan_filter_null)
pd_loan_filter_null[list_cat_vars] = pd_loan_filter_null[list_cat_vars].astype("category")
pd_loan_filter_null[list_cat_vars].head()
| FONDKAPREMONT_MODE | FLOORSMIN_MODE | FLOORSMIN_MEDI | OWN_CAR_AGE | ELEVATORS_MODE | ELEVATORS_MEDI | WALLSMATERIAL_MODE | ENTRANCES_MEDI | ENTRANCES_MODE | HOUSETYPE_MODE | FLOORSMAX_MEDI | FLOORSMAX_MODE | EMERGENCYSTATE_MODE | OCCUPATION_TYPE | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_YEAR | NAME_TYPE_SUITE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | CNT_FAM_MEMBERS | FLAG_DOCUMENT_8 | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_21 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | CNT_CHILDREN | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_12 | ORGANIZATION_TYPE | NAME_INCOME_TYPE | LIVE_CITY_NOT_WORK_CITY | NAME_CONTRACT_TYPE | REG_CITY_NOT_WORK_CITY | REG_CITY_NOT_LIVE_CITY | LIVE_REGION_NOT_WORK_REGION | REG_REGION_NOT_WORK_REGION | REG_REGION_NOT_LIVE_REGION | HOUR_APPR_PROCESS_START | WEEKDAY_APPR_PROCESS_START | REGION_RATING_CLIENT_W_CITY | REGION_RATING_CLIENT | FLAG_EMAIL | FLAG_PHONE | FLAG_CONT_MOBILE | FLAG_WORK_PHONE | FLAG_EMP_PHONE | FLAG_MOBIL | REGION_POPULATION_RELATIVE | NAME_HOUSING_TYPE | NAME_FAMILY_STATUS | NAME_EDUCATION_TYPE | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 100002 | reg oper account | 0.1250 | 0.1250 | NaN | 0.0000 | 0.00 | Stone, brick | 0.0690 | 0.0690 | block of flats | 0.0833 | 0.0833 | No | Laborers | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | Unaccompanied | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 0 | M | N | Y | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | Working | 0 | Cash loans | 0 | 0 | 0 | 0 | 0 | 10 | WEDNESDAY | 2 | 2 | 0 | 1 | 1 | 0 | 1 | 1 | 0.018801 | House / apartment | Single / not married | Secondary / secondary special | 1 |
| 100003 | reg oper account | 0.3333 | 0.3333 | NaN | 0.0806 | 0.08 | Block | 0.0345 | 0.0345 | block of flats | 0.2917 | 0.2917 | No | Core staff | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Family | 0.0 | 1.0 | 1.0 | 0.0 | 2.0 | 0 | F | N | N | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | School | State servant | 0 | Cash loans | 0 | 0 | 0 | 0 | 0 | 11 | MONDAY | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0.003541 | House / apartment | Married | Higher education | 0 |
| 100004 | NaN | NaN | NaN | 26.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Laborers | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Unaccompanied | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | M | Y | Y | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Government | Working | 0 | Revolving loans | 0 | 0 | 0 | 0 | 0 | 9 | MONDAY | 2 | 2 | 0 | 1 | 1 | 1 | 1 | 1 | 0.010032 | House / apartment | Single / not married | Secondary / secondary special | 0 |
| 100006 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Laborers | NaN | NaN | NaN | NaN | NaN | NaN | Unaccompanied | 0.0 | 2.0 | 2.0 | 0.0 | 2.0 | 0 | F | N | Y | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | Working | 0 | Cash loans | 0 | 0 | 0 | 0 | 0 | 17 | WEDNESDAY | 2 | 2 | 0 | 0 | 1 | 0 | 1 | 1 | 0.008019 | House / apartment | Civil marriage | Secondary / secondary special | 0 |
| 100007 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Core staff | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Unaccompanied | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 | M | N | Y | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Religion | Working | 1 | Cash loans | 1 | 0 | 0 | 0 | 0 | 11 | THURSDAY | 2 | 2 | 0 | 0 | 1 | 0 | 1 | 1 | 0.028663 | House / apartment | Single / not married | Secondary / secondary special | 0 |
list_cat_vars
['FONDKAPREMONT_MODE', 'FLOORSMIN_MODE', 'FLOORSMIN_MEDI', 'OWN_CAR_AGE', 'ELEVATORS_MODE', 'ELEVATORS_MEDI', 'WALLSMATERIAL_MODE', 'ENTRANCES_MEDI', 'ENTRANCES_MODE', 'HOUSETYPE_MODE', 'FLOORSMAX_MEDI', 'FLOORSMAX_MODE', 'EMERGENCYSTATE_MODE', 'OCCUPATION_TYPE', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'NAME_TYPE_SUITE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'CNT_FAM_MEMBERS', 'FLAG_DOCUMENT_8', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_21', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'CNT_CHILDREN', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_12', 'ORGANIZATION_TYPE', 'NAME_INCOME_TYPE', 'LIVE_CITY_NOT_WORK_CITY', 'NAME_CONTRACT_TYPE', 'REG_CITY_NOT_WORK_CITY', 'REG_CITY_NOT_LIVE_CITY', 'LIVE_REGION_NOT_WORK_REGION', 'REG_REGION_NOT_WORK_REGION', 'REG_REGION_NOT_LIVE_REGION', 'HOUR_APPR_PROCESS_START', 'WEEKDAY_APPR_PROCESS_START', 'REGION_RATING_CLIENT_W_CITY', 'REGION_RATING_CLIENT', 'FLAG_EMAIL', 'FLAG_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_WORK_PHONE', 'FLAG_EMP_PHONE', 'FLAG_MOBIL', 'REGION_POPULATION_RELATIVE', 'NAME_HOUSING_TYPE', 'NAME_FAMILY_STATUS', 'NAME_EDUCATION_TYPE', 'TARGET']
other
['COMMONAREA_AVG', 'COMMONAREA_MODE', 'COMMONAREA_MEDI', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAPARTMENTS_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAPARTMENTS_AVG', 'LIVINGAPARTMENTS_MODE', 'FLOORSMIN_AVG', 'YEARS_BUILD_AVG', 'YEARS_BUILD_MEDI', 'YEARS_BUILD_MODE', 'LANDAREA_MEDI', 'LANDAREA_AVG', 'LANDAREA_MODE', 'BASEMENTAREA_MEDI', 'BASEMENTAREA_AVG', 'BASEMENTAREA_MODE', 'EXT_SOURCE_1', 'NONLIVINGAREA_AVG', 'NONLIVINGAREA_MODE', 'NONLIVINGAREA_MEDI', 'ELEVATORS_AVG', 'APARTMENTS_AVG', 'APARTMENTS_MEDI', 'APARTMENTS_MODE', 'ENTRANCES_AVG', 'LIVINGAREA_MEDI', 'LIVINGAREA_MODE', 'LIVINGAREA_AVG', 'FLOORSMAX_AVG', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BEGINEXPLUATATION_AVG', 'TOTALAREA_MODE', 'EXT_SOURCE_3', 'EXT_SOURCE_2', 'AMT_GOODS_PRICE', 'AMT_ANNUITY', 'DAYS_LAST_PHONE_CHANGE', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'DAYS_ID_PUBLISH', 'DAYS_REGISTRATION', 'DAYS_EMPLOYED', 'DAYS_BIRTH']
Echamos un vistazo a las listas de variables categóricas y otras.
len(list_cat_vars)
74
len(other)
47
Se observa que hemos obtenido 74 variables categóricas y 47 no categóricas.
pd_loan_filter_null[list_cat_vars].dtypes
FONDKAPREMONT_MODE category FLOORSMIN_MODE category FLOORSMIN_MEDI category OWN_CAR_AGE category ELEVATORS_MODE category ELEVATORS_MEDI category WALLSMATERIAL_MODE category ENTRANCES_MEDI category ENTRANCES_MODE category HOUSETYPE_MODE category FLOORSMAX_MEDI category FLOORSMAX_MODE category EMERGENCYSTATE_MODE category OCCUPATION_TYPE category AMT_REQ_CREDIT_BUREAU_WEEK category AMT_REQ_CREDIT_BUREAU_HOUR category AMT_REQ_CREDIT_BUREAU_MON category AMT_REQ_CREDIT_BUREAU_QRT category AMT_REQ_CREDIT_BUREAU_DAY category AMT_REQ_CREDIT_BUREAU_YEAR category NAME_TYPE_SUITE category DEF_30_CNT_SOCIAL_CIRCLE category OBS_60_CNT_SOCIAL_CIRCLE category OBS_30_CNT_SOCIAL_CIRCLE category DEF_60_CNT_SOCIAL_CIRCLE category CNT_FAM_MEMBERS category FLAG_DOCUMENT_8 category CODE_GENDER category FLAG_OWN_CAR category FLAG_OWN_REALTY category FLAG_DOCUMENT_2 category FLAG_DOCUMENT_3 category FLAG_DOCUMENT_4 category FLAG_DOCUMENT_5 category FLAG_DOCUMENT_6 category FLAG_DOCUMENT_7 category FLAG_DOCUMENT_9 category FLAG_DOCUMENT_21 category FLAG_DOCUMENT_10 category FLAG_DOCUMENT_11 category CNT_CHILDREN category FLAG_DOCUMENT_13 category FLAG_DOCUMENT_14 category FLAG_DOCUMENT_15 category FLAG_DOCUMENT_16 category FLAG_DOCUMENT_17 category FLAG_DOCUMENT_18 category FLAG_DOCUMENT_19 category FLAG_DOCUMENT_20 category FLAG_DOCUMENT_12 category ORGANIZATION_TYPE category NAME_INCOME_TYPE category LIVE_CITY_NOT_WORK_CITY category NAME_CONTRACT_TYPE category REG_CITY_NOT_WORK_CITY category REG_CITY_NOT_LIVE_CITY category LIVE_REGION_NOT_WORK_REGION category REG_REGION_NOT_WORK_REGION category REG_REGION_NOT_LIVE_REGION category HOUR_APPR_PROCESS_START category WEEKDAY_APPR_PROCESS_START category REGION_RATING_CLIENT_W_CITY category REGION_RATING_CLIENT category FLAG_EMAIL category FLAG_PHONE category FLAG_CONT_MOBILE category FLAG_WORK_PHONE category FLAG_EMP_PHONE category FLAG_MOBIL category REGION_POPULATION_RELATIVE category NAME_HOUSING_TYPE category NAME_FAMILY_STATUS category NAME_EDUCATION_TYPE category TARGET category dtype: object
income_counts = pd_loan_filter_null['NAME_INCOME_TYPE'].value_counts()
income_percentages = pd_loan_filter_null['NAME_INCOME_TYPE'].value_counts(normalize=True) # Expreso los porcentajes de manera que podamos apreciar mejor los datos ya que en valor absoluto no somos conscientes del valor que pueden tener.
income_summary = pd.DataFrame({
'Count': income_counts,
'Percentage': income_percentages.round(2)
})
income_summary
| Count | Percentage | |
|---|---|---|
| NAME_INCOME_TYPE | ||
| Working | 158774 | 0.52 |
| Commercial associate | 71617 | 0.23 |
| Pensioner | 55362 | 0.18 |
| State servant | 21703 | 0.07 |
| Unemployed | 22 | 0.00 |
| Student | 18 | 0.00 |
| Businessman | 10 | 0.00 |
| Maternity leave | 5 | 0.00 |
plt.figure(figsize=(12, 6))
income_counts.plot(kind='bar', color='skyblue', edgecolor='black')
for i, (count, percentage) in enumerate(zip(income_counts, income_percentages)):
plt.text(i, count + 1000, f'{percentage:.2f}', ha='center', fontsize=10)
plt.title('Distribución de NAME_INCOME_TYPE', fontsize=14)
plt.ylabel('Frecuencia', fontsize=12)
plt.xlabel('NAME_INCOME_TYPE', fontsize=12)
plt.xticks(rotation=45)
plt.show()
contract_type_counts = pd_loan_filter_null['NAME_CONTRACT_TYPE'].value_counts()
contract_type_percentage = pd_loan_filter_null['NAME_CONTRACT_TYPE'].value_counts(normalize=True)
contract_type_summary = pd.DataFrame({
'Count': contract_type_counts,
'Percentage': contract_type_percentage
})
contract_type_summary
| Count | Percentage | |
|---|---|---|
| NAME_CONTRACT_TYPE | ||
| Cash loans | 278232 | 0.904787 |
| Revolving loans | 29279 | 0.095213 |
plt.figure(figsize=(12, 6))
contract_type_counts.plot(kind='bar', color='orange', edgecolor='black')
for i, (count, percentage) in enumerate(zip(contract_type_counts, contract_type_percentage)):
plt.text(i, count + 1000, f'{percentage:.2f}', ha='center', fontsize=10)
plt.title('Distribución de NAME_CONTRACT_TYPE', fontsize=14)
plt.ylabel('Frecuencia', fontsize=12)
plt.xlabel('NAME_CONTRACT_TYPE', fontsize=12)
plt.xticks(rotation=45)
plt.show()
Se observa que la mayoría de los préstamos son otorgados a trabajadores (más del 50%), seguido de comerciales asociados (23%) mientras que grupos como estudiantes, desempleados o empresarios son mucho menos frecuentes.
Por otro lado, los resultados de la celda 27 muestran que la mayoría de los préstamos en el dataset son "Cash loans" (préstamos en efectivo), con 278.232 registros (90%), mientras que los "Revolving loans" (préstamos rotativos) son mucho menos comunes, con 29.279 registros (10%). Esto sugiere que los clientes del dataset suelen preferir préstamos tradicionales con cantidades fijas y plazos definidos, mientras que el uso de créditos flexibles es significativamente menor.
pd_loan_filter_null[other].head(10)
| COMMONAREA_AVG | COMMONAREA_MODE | COMMONAREA_MEDI | NONLIVINGAPARTMENTS_AVG | NONLIVINGAPARTMENTS_MODE | NONLIVINGAPARTMENTS_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAPARTMENTS_AVG | LIVINGAPARTMENTS_MODE | FLOORSMIN_AVG | YEARS_BUILD_AVG | YEARS_BUILD_MEDI | YEARS_BUILD_MODE | LANDAREA_MEDI | LANDAREA_AVG | LANDAREA_MODE | BASEMENTAREA_MEDI | BASEMENTAREA_AVG | BASEMENTAREA_MODE | EXT_SOURCE_1 | NONLIVINGAREA_AVG | NONLIVINGAREA_MODE | NONLIVINGAREA_MEDI | ELEVATORS_AVG | APARTMENTS_AVG | APARTMENTS_MEDI | APARTMENTS_MODE | ENTRANCES_AVG | LIVINGAREA_MEDI | LIVINGAREA_MODE | LIVINGAREA_AVG | FLOORSMAX_AVG | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BEGINEXPLUATATION_MODE | YEARS_BEGINEXPLUATATION_AVG | TOTALAREA_MODE | EXT_SOURCE_3 | EXT_SOURCE_2 | AMT_GOODS_PRICE | AMT_ANNUITY | DAYS_LAST_PHONE_CHANGE | AMT_INCOME_TOTAL | AMT_CREDIT | DAYS_ID_PUBLISH | DAYS_REGISTRATION | DAYS_EMPLOYED | DAYS_BIRTH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | |||||||||||||||||||||||||||||||||||||||||||||||
| 100002 | 0.0143 | 0.0144 | 0.0144 | 0.0000 | 0.0 | 0.0000 | 0.0205 | 0.0202 | 0.022 | 0.1250 | 0.6192 | 0.6243 | 0.6341 | 0.0375 | 0.0369 | 0.0377 | 0.0369 | 0.0369 | 0.0383 | 0.083037 | 0.0000 | 0.0 | 0.00 | 0.00 | 0.0247 | 0.0250 | 0.0252 | 0.0690 | 0.0193 | 0.0198 | 0.0190 | 0.0833 | 0.9722 | 0.9722 | 0.9722 | 0.0149 | 0.139376 | 0.262949 | 351000.0 | 24700.5 | -1134.0 | 202500.0 | 406597.5 | -2120 | -3648.0 | -637 | -9461 |
| 100003 | 0.0605 | 0.0497 | 0.0608 | 0.0039 | 0.0 | 0.0039 | 0.0787 | 0.0773 | 0.079 | 0.3333 | 0.7960 | 0.7987 | 0.8040 | 0.0132 | 0.0130 | 0.0128 | 0.0529 | 0.0529 | 0.0538 | 0.311267 | 0.0098 | 0.0 | 0.01 | 0.08 | 0.0959 | 0.0968 | 0.0924 | 0.0345 | 0.0558 | 0.0554 | 0.0549 | 0.2917 | 0.9851 | 0.9851 | 0.9851 | 0.0714 | NaN | 0.622246 | 1129500.0 | 35698.5 | -828.0 | 270000.0 | 1293502.5 | -291 | -1186.0 | -1188 | -16765 |
| 100004 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.729567 | 0.555912 | 135000.0 | 6750.0 | -815.0 | 67500.0 | 135000.0 | -2531 | -4260.0 | -225 | -19046 |
| 100006 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.650442 | 297000.0 | 29686.5 | -617.0 | 135000.0 | 312682.5 | -2437 | -9833.0 | -3039 | -19005 |
| 100007 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.322738 | 513000.0 | 21865.5 | -1106.0 | 121500.0 | 513000.0 | -3458 | -4311.0 | -3038 | -19932 |
| 100008 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.621226 | 0.354225 | 454500.0 | 27517.5 | -2536.0 | 99000.0 | 490495.5 | -477 | -4970.0 | -1588 | -16941 |
| 100009 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.774761 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.492060 | 0.724000 | 1395000.0 | 41301.0 | -1562.0 | 171000.0 | 1560726.0 | -619 | -1213.0 | -3130 | -13778 |
| 100010 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.540654 | 0.714279 | 1530000.0 | 42075.0 | -1070.0 | 360000.0 | 1530000.0 | -2379 | -4597.0 | -449 | -18850 |
| 100011 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.587334 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.751724 | 0.205747 | 913500.0 | 33826.5 | 0.0 | 112500.0 | 1019610.0 | -3514 | -7427.0 | 365243 | -20099 |
| 100012 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.746644 | 405000.0 | 20250.0 | -1673.0 | 135000.0 | 405000.0 | -3992 | -14437.0 | -2019 | -14469 |
print(pd_loan_filter_null["OCCUPATION_TYPE"].value_counts().count())
pd_loan_filter_null["OCCUPATION_TYPE"]\
.apply(lambda x: str(x).lower().strip()).value_counts(normalize=True)
18
OCCUPATION_TYPE laborers 0.261396 sales staff 0.152056 core staff 0.130589 managers 0.101227 drivers 0.088116 high skill tech staff 0.053903 accountants 0.046481 medicine staff 0.040437 security staff 0.031835 cooking staff 0.028164 cleaning staff 0.022040 private service staff 0.012562 low-skill laborers 0.009914 waiters/barmen staff 0.006385 secretaries 0.006181 realty agents 0.003557 hr staff 0.002667 it staff 0.002491 Name: proportion, dtype: float64
top_occupation = pd_loan_filter_null['OCCUPATION_TYPE'].value_counts().head(10)
top_occupation.plot(kind='bar')
<Axes: xlabel='OCCUPATION_TYPE'>
La tabla muestra la distribución proporcional de los diferentes tipos de ocupaciones en el dataset. Los "laborers" son el grupo más representado, con un 26.14% del total, seguidos por "sales staff" (15.26%) y "core staff" (13.05%). En contraste, ocupaciones como "IT staff" (0.25%) y "HR staff" (0.35%) tienen la menor representación.
Exportación de la tabla¶
pd_loan_filter_null.to_csv('./pd_data_initial_preprocessing.csv')